Piecewise Holistic Autotuning of Compiler and Runtime Parameters
نویسندگان
چکیده
Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve full potential performance. Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring a long iterative evaluation. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate CERE tuning of compiler optimizations, number of threads and thread affinity on a NUMA architecture. On average over the NAS 3.0 benchmarks, we achieve a speedup of 1.08× after tuning. Tuning a single codelet is 13× cheaper than whole-program evaluation and estimates the tuning impact on the original region with a 94.6% accuracy. On a Reverse Time Migration (RTM) proto-application we achieve a 1.11× speedup with a 200× cheaper exploration.
منابع مشابه
Piecewise holistic autotuning of parallel programs with CERE
Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance. Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces calle...
متن کاملProfiling of Code_Saturne with HPCToolkit and TAU, and autotuning Kernels with Orio
This study has profiled the application Code Saturne, which is part of the PRACE benchmark suite. The profiling has been carried out with the tools HPCtookit and Tuning and Analysis Utilities (TAU) with the target of finding compute kernels suitable for autotuning. Autotuning is regarded as a necessary step in achieving sustainable performance at an Exascale level as Exascale systems most likel...
متن کاملPerformance Counter Monitoring for the Blue Gene / Q Architecture
This quarter’s newsletter features the performance area, which has a broad focus including autotuning, performance modeling, end-to-end performance measurement, and tool integration. We aim to extend and integrate mature, robust performance measurement technologies and develop a comprehensive performance data management framework that can be used by other areas of the SUPER project. End-to-end ...
متن کاملApplication-independent Autotuning for GPUs
Autotuning is an established technique for adjusting performance-critical parameters of applications to their specific run-time environment. In this paper, we investigate the potential of online autotuning for general purpose computation on GPUs. Our application-independent autotuner AtuneRT optimizes GPU-specific parameters such as block size and loop-unrolling degree. We also discuss the pecu...
متن کاملA compiler toolkit for array-based languages targeting CPU/GPU hybrid systems
This paper presents a compiler toolkit that addresses two important emerging challenges: (1) effectively compiling dynamic array-based languages such as MATLAB, Python and R; and (2) effectively utilizing a wide range of rapidly evolving hybrid CPU/GPU architectures. The toolkit provides: a high-level IR specifically designed to express a wide range of arraybased computations and indexing modes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016